Providing advanced genomic solutions
Customer Service: 0086-10-82837801 ext 849
Email: oversea_support@novogene.com


Novogene Co., Ltd


Contract information Contract content
Project number XXXXXXXX
Project name XXXXXXXXX
Report time 2023-08-17

1 Introduction


  Microbial populations exist in almost every ecological community in the earth. From the insect gut to the oceans, and can also be found in the sediment beneath.

  For a long time, microbial ecologists were mostly restricted to pure cultures of cultivable isolates to shed light on the diversity and functions of environmental microbes. Cultivability of environmental microbes often ranges below 1% of the total bacteria, limiting the research of microbial diversity.

  The metatranscriptomics starts with the metagenomics, which study the whole genome transcription and transcriptional regulation of colonies come from a particular environment and period on the overall level. It regards the all RNA as a research object, which not only avoid the problem of microbial culture and isolation and also expand effectively the use of microbial resources. In 2006, Leiniger and other people studied firstly metatranscriptions of complex microbial community. Compared with metagenomics, the metatranscriptomics can not only study the changes of complex microbial community, but also can better explore new genes.

  In the recent years, with the rapid development of sequencing and information technology, the magnanimous biological data and abundant microorganisms research information can be obtained by the Next Generation Sequencing, which is a very important technology to study the microbial diversity and community characteristics. The HMP (HMP, Human Microbiome Project, http://www.hmpdacc.org/), and EMP (EMP, Earth Microbiome Project, http://www.earthmicrobiome.org/) were mainly employing the high throughput sequencing technology.


Novogene Co., Ltd


2 Library Preparation and Sequencing


  Samples are collected from environment (such as soil, sea, fresh water, intestinal tract and so on) and further treated, and then sent to our company in appropriate form (such as the original samples, the extracted RNA, etc.). When we accepted the samples, we would conduct the necessary pre-experimental treatment and the strict quality control.

  The qualified samples were treated with fragments selection, library preparation and detection in order to obtain the qualified libraries. Then the libraries were sequenced using Illunima to gain raw data for the following information analysis.


Novogene Co., Ltd


2.1 Total RNA Sample QC


  All samples need to pass through the following four steps before library construction:

  1. Agarose Gel Electrophoresis: tests RNA degradation and potential contamination
  2. Nanodrop: tests RNA purity (OD260/OD280)
  3. Qubit: quantifies the RNA (determines concentration)
  4. Agilent 2100: checks RNA integrity


Novogene Co., Ltd


2.2 Library Construction and Quality Assessment

2.2.1 Library Construction


  From RNA samples extraction to final data obtained, every step of procedures containing samples testing, library construction and sequencing would directly affect the quality and quantity of data, which determines the bioinfomatic analysis results. Novogene promises to strictly control on all production data with high quality output and ensure the raw data accurate and reliable. The workflow is as follows:

  Adaptor: P5/P7 is PCR primers and those primers are complementary to sequences on flow cell; Rd1/Rd2 SP are read1/read2 sequencing primers; Index is used for identifying different libraries.

2.2.2 Library QC

  Library concentration was first quantified using a Qubit 2.0 fluorometer (Life Technologies), and then diluted to 1 ng/µl before checking insert size on an Agilent 2100 and quantifying to greater accuracy by quantitative PCR (Q-PCR) (library activity >2 nM).


Novogene Co., Ltd


2.3 Sequencing


  When the quality meets the requirements, libraries are fed into Illumina machines according to activity and expected data volume.The basic theory is sequencing by synthesis. Sequencing reagents (four types of dNTPs labeled by different fluorescences, DNA polymerases, and primers) are added into the flowcell for amplification and cluster generation. By detecting fluorescence from cluster extending complementary strand, the software turns light signals into sequencing peaks, thus the sequence information is acquired. The workflow chart is as follows:


Novogene Co., Ltd


3 Analysis Workflow


  Clean data is acquired by removing low quality reads from raw data for analysis in order to ensure the accuracy and reliability of results.

  First the reads will be assembled to conduct species classification analysis, and gene expression abundance analysis based on the effective data.Then we will perform KEGG analysis, homologous gene cluster analysis (eggNOG), carbohydrate enzyme analysis (CAZy) and other functional annotations in order to understand comprehensively the function information of microorganism.

  At last, we can carry out the comparative analysis between samples, such as cluster analysis, PCoA analysis (just for three or more samples) and so on to dig out the functional difference between samples.


Novogene Co., Ltd


4 Data Quality Control

4.1 Raw Data


  The original raw data from Illumina are transformed to Sequenced Reads by base calling. Raw data are recorded in a FASTQ file, which contains sequence information (reads) and corresponding sequencing quality information.

  Every read in FASTQ format is stored in four lines as follows:

@ST-E00310:278:HF3GJALXX:5:1101:6745:1924 1:N:0:ACGCTCGA  
TTTGGGCCCTTGGCAATGAATGTTGCCACCACTGTTCTGGGTGCAGAGGGGAAATGG
+  
AA&lt-FJFJ7J&ltJJFFJJFJJJ&ltFJJJJJJJJFJJJJFJJJJJAFJ7FJJJJ  
@ST-E00310:278:HF3GJALXX:5:1101:6908:1924 1:N:0:ACGCTCGA  

Line 1 begins with a ‘@’ character and is followed by the Illumina Sequence Identifiers and an optional description.

Line 2 is the raw sequence read.

Line 3 begins with a ‘+’ character and is optionally followed by the same sequence identifier and description.

Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of characters as there are bases in the sequence (Cock et al.).


Novogene Co., Ltd


4.2 Data Filtering


  Raw reads are filtered to remove reads containing adapters or reads of low quality, so that downstream analyses are based on clean reads.

The filtering process is as follows:

(1) Discard reads with adaptor contamination.
(2) Discard reads when uncertain nucleotides constitute more than 10 percents of either read (N > 10%).
(3) Discard reads when low quality nucleotides (base quality less than 20) constitute more than 50 percents of the read.

  Results are shown as percentage of total raw reads as follows:

The colors and scales represent percentage of different components:

  1. Adapter related: reads that had adapter contamination.

  2. Containing N: reads in which uncertain nucleotides constituted more than 10 percents of the read.

  3. Low quality: reads in which low quality nucleotides constituted more than 50 percents of the read.

  4. Clean reads: reads with eligible quality.


Novogene Co., Ltd


4.3 Error Rate Distribution


  The error rate for each base can be transformed by the Phred score as in equation 1(equation 1: Qphred = -10log10(e)).The relationship between Phred quality scores Q and base-calling error “e” is given below: Base Quality and Phred score relationship with the Illumina CASAVA v1.8 software:

Phred score Base Calling error rate Base Calling correct rate Q-sorce
10 1/10 90% Q10
20 1/100 99% Q20
30 1/1000 99.9% Q30
40 1/10000 99.99% Q40

  Sequencing error rate distribution has two main features: (1)Error rate increases as the sequencing reads are extended and sequencing reagents become more and more scarce. (2)the first six bases have a relatively high error rate due to the random hexamers used in priming cDNA synthesis (Jiang et al.).


Novogene Co., Ltd


4.4 A/T/G/C Content Distribution


  AT/GC content distribution is evaluated to detect potential AT/GC separation, which affects subsequent gene expression quantification. In view of the random interrupted and the principle of G/C, A/T content are respectively equal, G and C, A and T should be equal and content is stable on the whole through the whole sequencing process for non-stranded library. Theoretically, G should equal C, and A should equal T throughout the whole sequencing process for non-stranded libraries, whereas AT/GC separation is normally observed in stranded libraries. For NEB libraries, a large variation of sequencing error in the first 6-7 bases is allowed due to the use of random primers in library construction. If the library is strand-specific, it may occur AT separation or GC separation.


Novogene Co., Ltd


4.5 Data Quality Control Summary


sample raw_reads clean_reads clean_bases error_rate Q20 Q30 GC_pct
ten_ug1 4165525 4110932 1.23G 0.02 98.46 95.24 47.74
ten_ug2 4133675 3986768 1.2G 0.02 98.45 95.24 47.53
ten_ug3 4149225 4077538 1.22G 0.02 98.38 95.07 46.97
control1 4136875 4033364 1.21G 0.02 98.55 95.48 46.80
control2 4139550 4033271 1.21G 0.02 98.30 95.13 51.91
control3 4167275 4089642 1.23G 0.03 97.46 92.69 42.96
one_00ng_1 4132600 4053201 1.22G 0.02 98.40 95.12 46.41
one_00ng_2 4154275 4102782 1.23G 0.02 98.44 95.16 46.29
one_00ng_3 4134650 4063720 1.22G 0.02 98.45 95.20 47.17

(1)Sample: the names of samples
(2)raw.Reads: the original sequencing reads counts
(3)clean.Reads: number of reads after filtering
(4)clean.Bases: clean reads number multiply read length, saved in G unit
(5)error.rate: average sequencing error rate, which is calculated by Qphred=-10log10(e)
(6)Q20: percentages of bases whose correct base recognition rates are greater than 99% in total bases
(7)Q30: percentages of bases whose correct base recognition rates are greater than 99.9% in total bases
(8)GC.pct: percentages of G and C in total bases


Novogene Co., Ltd


4.6 Data Quality Control Q&A


Q: The error rate will be higher in longer sequence length, what is the acceptable range of error rate?

A:Novogene set a high requirement for sequencing quality. Generally, error rate of single base should be lower than 1%. For special case, the maximum can be as high as 6%.

Q: What are the quality control criteria in Novogene?

 A:To guaranteeing the quality of bioinformatic analysis, Novogene will set serveral criteria for clean data:
 (1) Removing the reads with adapter;
 (2) Removing the reads with more than 10% N (N represent the base with uncertainty);
 (3) Removing low quality reads(reads with more than 50% low quality base, Quality value sQ <= 20).

Terminology

Adapter:A necessary part for illumina sequencing. Introduced in library preparation and could be recognized by fixed adapters in the flow cell.
Index:Used in sample mixing, it is added to different samples and can be used to identify samples in a mixture.
Q20,Q30:The percentage of bases with Phred greater than 20, 30. Phred=-10log10(e).
raw data/raw reads:Raw data from sequencing platform.
Clean data/Clean reads:The filtered data, which is processed by removing low quality reads. All the bioinformatic analysis is based on clean reads.


Novogene Co., Ltd


5 Transcriptome Reconstruction


  For clean reads from preprocessing, rRNA, tRNA, and SILVA databases from NCBI are used to map and isolate rRNA reads from metatranscriptome. The left reads are assembled with Trinity software, then integrate the reads and remove redundance with CORSET software. Thus the unigene sets are obtained.

5.1 Transcriptome Reconstruction


  Trinity is developed by Broad Institute and Hebrew University of Jerusalem. It is a professional transcriptome assembler software (comprising modules entitled Inchworm, Chrysalis and Butterfly). The workflow of Trinity is as follows:

  • Inchworm:Constructs a k-mer dictionary from all sequenced reads (in practice, k = 25), selects the most frequent seeding k-mer in the dictionary and extends the seed in each direction to form a contig assembly.
  • Chrysalis:Chrysalis clusters minimally overlapping Inchworm contigs into sets of connected components, and constructs complete de Bruijn graphs for each component. Each component defines a collection of Inchworm contigs that are likely to be derived from alternative splice forms or closely related paralogs.
  • Butterfly:Butterfly reconstructs plausible, full-length, linear transcripts by reconciling the individual de Bruijn graphs generated by Chrysalis with the original reads and paired ends. It reconstructs distinct transcripts for splice isoforms and paralogous genes, and resolves ambiguities stemming from errors or from sequences >k bases long that are shared between transcripts. The final assembled result file: TRINITY.fasta.

  The sequencing data of assembled transcriptome is recorded under FASTA format as follows:

>c8_g1_i1 len=256 path=[234:0-255]
TGGAGTTATTCATGAGAATAGTTTCGAGGTTCTTGAGGGTGTCTTGTGCAGGATTTCCAA
GAATTCTGAGGTCAAAGGTGGCATTATTAAGGCCAGGAAGAGTGAAGGACTTATCAGAGC
CAGTGATATTGAGTCTTCTGAGAAGGGGATATCTCCAATCGTTAATTTGAAGGACGAATC
TGTCGTACTCCTTGAAATAAAGGACGTTGAGAGGTCCCATAGCAGCGAAAATGGGCTTGT
TGCCTCTCATTTTTTG
>c12_g1_i1 len=245 path=[136:0-105 400:106-192 242:193-244]
CGGGCCGCCCCTCGTACTGCTTGAACGTCTCGACATAACGGGCCACCTCCTTATAGCTCC
ACGCTAGAATGAGTGTAACGTTGAGACCGAGACACTGCTGAGACAGCTGCAGAGGGTCGG
TCGGTGCGAGTAACACAAGAGAGTGGGTCAAGTTAAGCGATCTCTGCCTGCGAGTATACA
ACACTCCACGCACCTCGCGTATGGCAGCCGCCGAGTCATCCGTCGAGTCTACATGACAGA
GCACG

  Line 1 starts with ‘>’ character and followed by the id number of the transcript;‘len=’ shows the length of the transcript, which is the base number of the transcript; ‘path’ includes the pathway information from de Bruijn Graph subComponet. From line 2 to the end encodes the sequence information of the transcript. More detailed explanation could be found from Trinity’s website http://trinityrnaseq.github.io.

5.2 Transcript Redundancy Removal and Length Distribution


  (1) Redundancy Removal: The basic processing approach is to sort the transcript sequence by length. Then start process the transcript sequence from the one with highest-length. If a new transcript sequence has higher similarity than cutoff value to the known representative transcript sequence of a particular transcript sequence cluster, the new read is classified into the cluster. Otherwise a new cluster including that new read is generated. (CORSET website: https://github.com/Oshlack/Corset/wiki)

  (2) Length Distribution: The longest transcripts of each cluster will be selected as unigenes. Length distribution information of transcripts and unigenes are listed in the following tables:

  Overview of the number of transcripts and unigenes in different length intervals:

Length_interval 200bp-500bp 500bp-1kbp 1kb-2kbp >2kbp Total
Number of transcripts 260904 102508 41256 21589 426257
Number of Unigenes 102898 82297 30991 15221 231407

  Overview of the length distribution of transcripts and unigenes:

Min_length Mean_length Median_length Max_length N50 N90 Total_nucleotides
Transcripts 251 709 417 44848 911 312 302007502
Genes 251 855 538 44848 1050 402 197902823
  • N50/N90: Ranking all transcripts by its length from high to low. Adding the length one by one from the first. When the sum is greater than 50%/90% of the total length, the last added sequence length is N50/N90.

  Length distribution of transcripts and unigene:


Novogene Co., Ltd


5.3 Transcriptome Reconstruction Q&A


Q: What are the main parameters used in reconstruction?

A: The main parameters involved in reconstruction:

  • –SS_lib_type: The reconstruction parameter of strand specific library, used to distinguish the strand information of reads and enhance the accuracy of reconstruction.

  • –min_kmer_cov: Minimum kmer coverage, default as 2 by Novogene. This parameter will be used in k-mer filtering. If k-mer’s coverage minor than 2, it will be recognized as a low quality k-mer (Probably, this k-mer was caused by sequencing mistakes) and was discarded in reconstruction. If you focus on genes with low abundance, min_kmer_cov can be set as 1.

  • –min_glue: When contig aggregated to form components, supporting reads was needed: There is n reads spanning the junction area, and expanding (k-1)/2 bases(k is k-mer’s k value, k=25 in TRINITY, which means it expands 12bp in both sides) match. The min_glue is reads number n. In this case, if higher reconstruction quality is desired with reducing the number of produced reconstructed gene, we can modulate this parameter to achieve the goal.

Terminology

K-mer:When you do the continuous splicing on read with moving size k window, you will get a k length sequence each time, k-mer are those sequence in length k. For example, read: ATCGTT, k is 3, you will get following k-mer: ATC,TCG,CGT,CTT. TRINITY k is 25.

K-mer Dictionary:Used to store k-mer and its coverage, “key” is similar to the index in the real dictionary, “value” is similar to the description in real dictionary, in this case, “key” represent k-mer, “value” is k-mer’s coverage, “key” is unique, and one key has one unique value, thus, each k-mer has its coverage.

Contig:In TRINITY reconstruction, in Inchworm step we will obtain the contig. According to TRINITY algorithm, different contig can not share more than k sequences, therefore, those Inchworm contigs cannot well present different alternative splicings and alleles, which need more process.

Components:In TRINITY reconstruction, the contigs from Inchworm could be clustered and aggregated by its overlapping relationship, and form the components. They are a set of potential alternative splicing isoform or potential allele representation.

De Bruijn graph:In TRINITY reconstruction, in (chrysalis) step, components will be used to form the de Bruijn graph through overlapping relationship, which is used to acquiring the final result sequence and alternative splicing sequence.

Unigene:The longest transcript of each gene, was called as unigene, which is a basis for following analysis.

N50/N90:Sort and list the spliced transcripts by length (from longest to shortest), add the lengths until the accumulated length reaches 50%(90%) of the total length, the threshold transcript length is defined as N50(N90).


Novogene Co., Ltd


6 Species Annotation


6.1 Species Annotation step

  • Alignment of Unigenes with Bacteria, Fungi, Archaea, and Viruses sequences extracted from NCBI’s NR database using DIAMOND software [27] (blastp, evalue ≤ 1e-5);
  • Alignment result filtering: For the alignment result of each sequence, select the Alignment result of evalue <= minimum evalue*10[15] for subsequent analysis. Since multiplealignment results may occur after filteration of each sequence, several differrent species classification result will be obtained. To ensure the biological significance, LCA algorithm (for systematic classification of MEGAN [28] software) is applied, the classification level before the first branch is thought as species annotation of the sequence.
  • From the LCA annotation results and the gene abundance table, obtain the abundance information of each sample at each classification level(taxonomic rank) from the LCA annotation results and the gene abundance table. Abundance of a species in a sample, is equal to the sum of abundance of genes that are annotated to that species [13,14,18];
  • Obtain the gene number information of each sample at each classification level (taxonomic rank) from the LCA annotation results and the gene abundance table. Gene number of one species in a sample is equal to number of genes that are annotated to that species with abundance unequal to 0.
  • The statistical maps and abundance cluster heat maps at different classification levels are drawn from the abundance tables at each classification level(taxonomic rank).

6.2 Relative species abundance statistics

  Select the top 10 species with the highest relative abundance according to the relative abundance of different classification level. Set the remaining species as ‘Others’, Plot statistical graphs of each sample’s species annotation.

  • Note:k10.dis, p10.dis, c10.dis, o10.dis, f10.dis, g10.dis and s10.dis means: Kingdom, Phylum, Class, Order, Family, Genus, Species. There are the top 10 species statistics of the seven largest classification levels.

6.3 Species abundance clustering

  The dominant 35 genera within each sample are selected based on the results of species annotation and abundance information, and then clustered by its taxonomy information and the inter-sample differences among samples.

  • k10.cluster, p10.cluster, c10.cluster, o10.cluster, f10.cluster, g10.cluster and s10.cluster means Kingdom, Phylum, Class, Order, Family, Genus, Species in seven largest classification levels.
  • Clustering tree based on Bray-Curtis distance Description: In the above figure, the Bray-Curtis distance clustering tree structure is on the left side; the relative abundance distribution map of each sample at the door level is shown on the right side.

6.4 Cluster Analysis of Samples Based on Species Abundance

  In order to study the similarity of different samples, a clustering tree of samples can also be constructed by clustering the samples. Bray-Curtis distance is the most commonly used distance index in system clustering. It is mainly used to describe the similarity between samples. Its size is the main basis for sample classification.

  According to the gene abundance of each sample, the Bray-Curtis distance matrix was used for cluster analysis between samples, and the clustering results were integrated with the relative abundance of the species at the phylum level.

6.5 Metastat analysis of species differences

  Differential analysis of the composition spectrum of species annotations can examine the magnitude of differences between samples at the species composition level. Compare the abundance differences of each taxonomic unit between samples based on the composition abundance of each sample at each classification level, and evaluate whether the differences are significant through statistical testing. To explore species with significant differences between groups, the Metastats (integrated non parametric multiple tests and p-value correction) method was used to test the hypothesis of species abundance data between groups and obtain p-values. The p-values were corrected to obtain q-values; Finally, select species with significant differences based on the q value[9,10].

Taxa mean.group1. variance.group1. standard.error.group1. mean.group2. variance.group2. standard.error.group2. p.value q.value
k__Archaea;p__Candidatus Thermoplasmatota;c__Thermoplasmata;o__Methanomassiliicoccales;f__Methanomassiliicoccaceae;g__Methanomassiliicoccus;s__Candidatus Methanomassiliicoccus intestinalis 3.0e-07 0 1e-07 2.3e-06 0 2.1e-06 0.5318826 1
k__Archaea;p__Candidatus Thermoplasmatota;c__Thermoplasmata;o__Methanomassiliicoccales;f__Unclassified;g__Unclassified;s__Methanomassiliicoccales archaeon 3.0e-07 0 3e-07 0.0e+00 0 0.0e+00 1.0000000 1
k__Archaea;p__Candidatus Thermoplasmatota;c__Thermoplasmata;o__Unclassified;f__Unclassified;g__Unclassified;s__methanogenic archaeon ISO4-H5 1.1e-06 0 8e-07 1.0e-06 0 7.0e-07 0.9622373 1
k__Bacteria;p__Acidobacteriota;c__Holophagae;o__Acanthopleuribacterales;f__Acanthopleuribacteraceae;g__Acanthopleuribacter;s__Acanthopleuribacter pedis 0.0e+00 0 0e+00 4.5e-06 0 4.5e-06 0.4791545 1

6.6 Tree of species differences

  Due to differences between groups, species may have potential roles in specific physiological states or environments. Therefore, in order to fully explore the species information of inter group differences, a species distribution tree diagram is constructed.

  Note:1)Color: level
     2)Circle size: relative content rate
     3)Percentage: percentage of the whole, percentage of the previous level


Novogene Co., Ltd


7 Functional Annotation


7.1 Functional Annotation Database Introduction


  To achieve comprehensive gene functional annotation, four databases (GO, KEGG, CAZy and eggNOG) are applied by Novogene. The function and characteristics of the four databases are as follows:

database Introduction details Software.and.parameters
GO Gene Ontology, GO is the established standard for the functional annotation of gene products. GO vocabulary is a controlled vocabulary used to classify the following functional attributes of gene products: Biological Process (BP), Molecular Function (MF) and Cellular Component (CC). GO term is the basic unit of GO system. Each term has a unique identifier. The relationship between the GO term of each ontology can form a Directed Acyclic Topology. More details could be found from the website: http://www.geneontology.org/. Base on the result of NR and Pfam protein annotation,use Blast2GO v2.5(Götz et al., 2008)and Novogene script,e-value=1e-6
KEGG Kyoto Encyclopedia of Genes and Genomes KEGG is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies. It contains KEGG PATHWAY, KEGG DRUG, KEGG DISEASE, KEGG MODULE, KEGG GENES, KEGG GENOME etc. And KO system (KEGG ORTHOLOG) combines each KEGG annotation system. KEGG has established a complete KO annotation system which can accomplish the function annotation of the genome/transcriptome of a newly sequenced species. More details could be found from the following website: http://www.genome.jp/kegg/ . Diamond v2.1.6, e-value=1e-5, –more-sensitive
CAZy Carbohydrate-Active Enzymes Database The CAZy database describes the families of structurally-related catalytic and carbohydrate-binding modules (or functional domains) of enzymes that degrade, modify, or create glycosidic bonds.Mainly covers 6 major functional categories:Glycoside Hydrolases (GHs),Glycosyl ransferases(GTs),Polysaccharide Lyases(PLs),Carbohydrate sterases(CEs),Auxiliary Activities(AAs),and Carbohydrate-Binding Modules(CBMs). Diamond v2,1,6, e-value=1e-5, –more-sensitive
eggNOG Evolutionary genealogy of genes: Non-supervised Orthologous Groups eggNOG’s database currently counts 721,801 orthologous groups in 1133 species, covering 4,396,591 proteins (built from 5,214,234 proteins)., e-value=1e-5, –more-sensitive

7.2 Functional Annotation step


  1. Map Unigenes with each functional database using DIAMOND software (blastp,evalue ≤ 1e-5) ;
  2. Alignment result filtering: in order to ensure the biological significance of the subsequent study, the alignment result of each sequence is screened, and the coverage ratio BCR of each gene in the reference and query is calculated ( The BLAST Coverage Ratio ensures that the BCR (reference) and BCR (Que.) in each comparison record is greater than 40%, and then statistically summarizes according to the characteristics of each database, and finally obtains the corresponding function annotation information. The BCR value of the reference and query genes is calculated as follows, where the Match is the effective length of both Alignment, the length (R) is the length of the reference gene, and the length (Q) is the length of the query gene. BCR (Ref.) = (Match/Length(R)) x100%; BCR (Que.) = (Match/Length (Q) x100%; 3)From the Alignment results, the relative abundance of different functional levels is counted (the relative abundance of each functional level is equal to the sum of the relative abundances of the genes annotated as the functional level), wherein the KEGG database is divided into 5 levels, the eggNOG database Divided into 3 levels, the CAZy database is divided into 3 levels, and the detailed division level of each database is as follows:
database.name level description.of.level
KEGG level1 KEGG pathway level1 include 6 pathway database;
KEGG level2 KEGG pathway level2 43 sub-pathway database;
KEGG level3 KEGG pathway id(e.g. map00010);
KEGG KO KEGG ortholog group (e.g. K00010);
KEGG ec KEGG EC Number(e.g. EC 3.4.1.1);
eggNOG level1 24 function taxa;
eggNOG level2 ortholog group description;
eggNOG og ortholog group ID(e.g. ENOG410YU5S);
CAZy level1 6 major function classes;
CAZy level2 CAZy family(e.g. GT51);
CAZy ec EC number(e.g. murein polymerase (EC 2.4.1.129));
  1. Starting from the functional annotation results and the gene abundance table, the number of genes on each classification level of each sample is obtained. The number of genes in a sample is equal to the abundance in the gene annotated as the function isnot 0.
  2. Starting from the abundance table at each classification level, the number of annotated genes is counted, the relative abundance profile and the abundance clustering heatmap and PCA analysis are displayed.


Novogene Co., Ltd


7.3 GO annotation


  GO is the abbreviation of Gene Ontology which is a widely used classification system in bioinformation field. (See http://www.geneontology.org/). With GO annotation, the successfully annotated genes will be grouped into three main GO domains: Biological Process (BP), Cellular Component (CC), Molecular Function (MF). There are many levels under the three branches, higher level contains more detailed annotations(level 1 as the top level). The GO annotation is significant in understanding the biological meaning of genes. After GO annotation by blast2go , the annotated genes are classified by the three classes(BP,CC and MF).


Novogene Co., Ltd


7.4 KEGG annotation


  KEGG (Kyoto Encyclopedia of Genes and Genomes, http://www.kegg.jp/) is an important database resource for pathway analysis. Since tin 1995, it has developed into a comprehensive data base. The key database is KEGGPATHWAY data base. There are 6 types of biological metabolism paths: cellular processes, environmental information processing, genetic information processing, human diseases, metabolism and organismal systems. The genes successfully annotated in KEGG can be classified according to the KEGG pathway they joined in. Each class has second, third and fourth level. The second level contains 43 sub-functions, the third level is metabolism paths, the fourth level is annotaton information for each metabolism path.


7.4.1 alignment result of KEGG


  Starting from the results of the Unigenes annotations, the Kegg_geneID, Ko_id, Ko_name and Ko_class information of the database are statistically alignment.

Query_id Subject_id KO_ID KO_NAME KO_DEFINITION KO_EC Module
TRINITY_DN109070_c0_g1 lxa:OW255_04045
TRINITY_DN293930_c0_g1 esu:EUS_07740 K15635 apgM 2,3-bisphosphoglycerate-independent phosphoglycerate mutase 5.4.2.12 M00001; Pathway module; Carbohydrate and lipid metabolism; Central carbohydrate metabolism; Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate [PATH:map01200 map00010] | M00002; Pathway module; Carbohydrate and lipid metabolism; Central carbohydrate metabolism; Glycolysis, core module involving three-carbon compounds [PATH:map01200 map01230 map00010] | M00003; Pathway module; Carbohydrate and lipid metabolism; Central carbohydrate metabolism; Gluconeogenesis, oxaloacetate => fructose-6P [PATH:map00010 map00020]
TRINITY_DN94779_c0_g1 rus:RBI_I00452 K01740 metY O-acetylhomoserine (thiol)-lyase 2.5.1.49
TRINITY_DN154632_c4_g2 lasa:L9O85_04530

7.4.2 Annotation gene number statistics


   Starting from the Unigenes annotation results, a statistical map of the number of genes annotated to the KEGG database is drawn, showing the results as follows:

  • Y-axis is the names of KEGG pathways; X-axis is the number of the genes annotated in the pathway and the total number of annotated genes. The KEGG metabolic pathways gene involved are divided into 5 branches: Cellular Processes, Environmental Information Processing, Genetic Information Processing, Metabolism, Organismal Systems. ***

7.4.3 Annotation function relative abundance statistics


   Based on the relative abundance table of level 1 in each database, the abundance statistics of each sample at level 1 in each database are drawn.

7.4.4 Annotation function relative abundance analysis


  The dominant 35 genera within each sample are selected based on the results of species annotation and abundance information, and then clustered by its taxonomy information and the inter-sample differences among samples.

7.4.5 Annotation function abundance PCA


  Principal component analysis (PCA) is a statistical procedure, using an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. For the multi Sample, the more similar they are, the closer sample points in the PCA figure could be get. The results which is based on Level2, level3 and KO levels of KEGG function abundance are shown as follows:

7.4.6 Annotation function abundance PCoA


  Principal coordinates analysis (PCoA) is an ordination technique, which picks up the main elements and structure from reduced multi-dimensional data series of eigenvalues and eigenvectors. The technique has the advantage over PCA that each ecological distance can be investigated. Weighted Unifrac and Unweighted Unifrac are calculated to assist the PCoA analysis. The results which is based on Level1 PCoA levels of function abundance are shown as follows:

7.4.7 Annotation function abundance clustering


  According to functional classification results of KEGG, the functional cluster analysis can be performed between multi-samples in order to analyze the similarity on function. The Bray-Curtis distance is the most common index describing similarity of samples, which is the basis of classification. Conduct cluster analysis among samples with Bray-Curtis distance, and display the results along with relative functional abundanceon level 1. The integrated results are shown as follows:


Novogene Co., Ltd


7.5 eggNOG annotation


   eggNOG (evolutionary genealogy of genes: Non-supervised Orthologous Groups) is a database of orthologous groups of genes. The orthologous groups are annotated with functional description lines (derived by identifying a common denominator for the genes based on their various annotations), with functional categories (i.e derived from the original COG/KOG categories). eggNOG’s database currently covers about 721,801 orthologous groups in 1133 species, and 62.5% of which has broad annotation information that can be used to eggNOG annotation:

7.5.1 alignment result of eggNOG


  Starting from the Unigenes annotation, database statistics of orthologous genes and other relevant information.

Query_id Subject_id Ortholog_Group Functional_Category OG_Description
TRINITY_DN109070_c0_g1 397287.C807_02602 COG3706 TRUE GGDEF domain
TRINITY_DN109070_c0_g1 397287.C807_02602 COG2199 TRUE diguanylate cyclase activity
TRINITY_DN109070_c0_g1 397287.C807_02602 COG2199 TRUE diguanylate cyclase activity
TRINITY_DN109070_c0_g1 397287.C807_02602 ENOG5027PI4 TRUE cheY-homologous receiver domain

7.5.2 Annotation gene number statistics


   Starting from the results of Unigenes annotations, a statistical map of the number of genes annotated to the eggNOG database is drawn, showing the results as follows:

7.5.3 Annotation of function relative abundance statistics


   Based on the relative abundance table of level 1 in each database, the abundance statistics of each sample at level 1 in each database are drawn.

7.5.4 Relative functional abundance clustering


  The dominant 35 genera within each sample are selected based on the results of species annotation and abundance information, and then clustered by its taxonomy information and the inter-sample differences among samples.

7.5.5 Functional abundance PCA


  Principal component analysis (PCA) is a statistical procedure, using an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. For the multi Sample, the more similar they are, the closer sample points in the PCA figure could be get. The results which is based on Level1, level3 and eggNOG levels of KEGG function abundance are shown as follows:

7.5.6 Functional abundance PCoA


  Principal coordinates analysis (PCoA) is an ordination technique, which picks up the main elements and structure from reduced multi-dimensional data series of eigenvalues and eigenvectors. The technique has the advantage over PCA that each ecological distance can be investigated. Weighted Unifrac and Unweighted Unifrac are calculated to assist the PCoA analysis. The results which is based on Level1 eggNOG levels of function abundance are shown as follows:

7.5.7 Functional abundance clustering of multi-samples


  According to functional classification results of eggnog, the functional cluster analysis can be performed between multi sample in order to analyze the similarity on function, the results are shown as follows:


Novogene Co., Ltd


7.6 CAZy annotation


  The CAZy database describes the families of structurally-related catalytic and carbohydrate-binding modules (or functional domains) of enzymes that degrade, modify, or create glycosidic bonds. Enzyme Classes currently covered: Glycoside Hydrolases (GHs), Glycosyl Transferases (GTs), Polysaccharide Lyases (PLs) , Carbohydrate Esterases (CEs) , Auxiliary Activities (AAs) , Carbohydrate-Binding Modules (CBMs). Map the gene cluster to CAzy database by hmmsean tool of CAzy, the e-value of expected coefficient is 1e-5. After acquiring the carbon hydrate enzyme’s annotation information, the abundance of carbon hydrate enzyme is calculated with total of corresponding gene abundance.

7.6.1 alignment result of CAZy


  Starting from the results of Unigenes annotations, Statistics annotate to the database of carbohydrate enzyme species related information.

Query_id Subject_id CAZy_Family Family_Description
TRINITY_DN109070_c0_g1 CAI2717932.1 CE11 UDP-3-0-acyl N-acetylglucosamine deacetylase (EC 3.5.1.108).
TRINITY_DN134865_c0_g1 BBF44784.1 CBM34 Modules of approx. 120 residues. Granular starch-binding function has been demonstrated in the case of Thermoactinomyces vulgaris R-47 alpha-amylase 1 (TVAI).
TRINITY_DN134865_c0_g1 BBF44784.1 GH13 alpha-amylase (EC 3.2.1.1); pullulanase (EC 3.2.1.41); cyclomaltodextrin glucanotransferase (EC 2.4.1.19); cyclomaltodextrinase (EC 3.2.1.54); trehalose-6-phosphate hydrolase (EC 3.2.1.93); oligo-alpha-glucosidase (EC 3.2.1.10); maltogenic amylase (EC 3.2.1.133); neopullulanase (EC 3.2.1.135); alpha-glucosidase (EC 3.2.1.20); maltotetraose-forming alpha-amylase (EC 3.2.1.60); isoamylase (EC 3.2.1.68); glucodextranase (EC 3.2.1.70); maltohexaose-forming alpha-amylase (EC 3.2.1.98); maltotriose-forming alpha-amylase (EC 3.2.1.116); branching enzyme (EC 2.4.1.18); trehalose synthase (EC 5.4.99.16); 4-alpha-glucanotransferase (EC 2.4.1.25); maltopentaose-forming alpha-amylase (EC 3.2.1.-) ; amylosucrase (EC 2.4.1.4) ; sucrose phosphorylase (EC 2.4.1.7); malto-oligosyltrehalose trehalohydrolase (EC 3.2.1.141); isomaltulose synthase (EC 5.4.99.11); malto-oligosyltrehalose synthase (EC 5.4.99.15); amylo-alpha-1,6-glucosidase (EC 3.2.1.33); alpha-1,4-glucan: phosphate alpha-maltosyltransferase (EC 2.4.99.16); amino acid transporter; [retaining] sucrose 6(F)-phosphate phosphorylase (EC 2.4.1.329); [retaining] glucosylglycerol phosphorylase (EC 2.4.1.359); ; Glucosylglycerate phosphorylase (EC 2.4.1.352); [retaining] sucrose alpha-glucosidase (EC 3.2.1.48); oligosaccharide alpha-4-glucosyltransferase (EC 2.4.1.161); [retaining] alpha-amylase (EC 3.2.1.1);
TRINITY_DN239720_c0_g1 QWU85921.1 GH130 beta-1,4-mannosylglucose phosphorylase (EC 2.4.1.281); beta-1,4-mannooligosaccharide phosphorylase (EC 2.4.1.319); beta-1,4-mannosyl-N-acetyl-glucosamine phosphorylase (EC 2.4.1.320); beta-1,2-mannobiose phosphorylase (EC 2.4.1.339); beta-1,2-oligomannan phosphorylase (EC 2.4.1.340); beta-1,2-mannosidase (EC 3.2.1.-); beta-1,3-oligomannan phosphorylase (EC 2.4.1.-)

7.6.2 Annotation gene number statistics


   Starting from the results of Unigenes annotations, a statistical map of the number of genes annotated to the CAZy database is drawn, showing the results as follows:


7.6.3 Functional relative abundance statistics


   Based on the relative abundance table of level 1 in each database, the abundance statistics of each sample at level 1 in each database are drawn.


7.6.4 Functional relative abundance clustering


  The dominant 35 genera within each sample are selected based on the results of species annotation and abundance information, and then clustered by its taxonomy information and the inter-sample differences among samples.

7.6.5 Functional abundance PCA


  Principal component analysis (PCA) is a statistical procedure, using an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. For the multi Sample, the more similar they are, the closer sample points in the PCA figure could be get.For the multi Sample, the more similar they are, the closer sample points in the PCA figure could be get. The results which is based on Level1, level2 and CAZy levels of function abundance are shown as follows:

7.6.6 Functional abundance PCoA


  Principal coordinates analysis (PCoA) is an ordination technique, which picks up the main elements and structure from reduced multi-dimensional data series of eigenvalues and eigenvectors. The technique has the advantage over PCA that each ecological distance can be investigated. Weighted Unifrac and Unweighted Unifrac are calculated to assist the PCoA analysis. The results which is based on Level1 CAZy levels of function abundance are shown as follows:

7.6.7 Functional abundance clustering of multi samples


  To analyze the similarity of different samples, cluster analysis trees of samples are constructed. It is the basis of sample classification that describes the closeness between samples. According to functional classification results of CAZy, the functional cluster analysis can be performed between multi sample in order to analyze the similarity on function, the results are shown as follows:


Novogene Co., Ltd


7.7 CARD resistance genes annotation

  environmental microbes. The abuse of antibiotics leads to irreversible changes in microbial communities in the human body and the environment, resulting in risks to human health and the ecological environment. Therefore, the study of resistance genes has attracted wide attention of researchers. The Comprehensive Antibiotic Resistance Database (“CARD”) provides data, models, and algorithms relating to the molecular basis of antimicrobial resistance. The CARD provides curated reference sequences and SNPs organized via the Antibiotic Resistance Ontology (“ARO”). e. ***

7.7.1 resistance genes annotation step


1)Unigenes aligns with the CARD database using the Resistance Gene Identifier (RGI) software provided by the CARD database (RGI built-in blastp, default evalue ≤ 1e-30);

2)According to the RGI alignment results, combined with the abundance information of Unigenes, the relative abundance of each ARO is counted;

3)From the abundance of ARO, abundance histogram display, abundance cluster heat map display, abundance distribution circle map display, ARO difference analysis between groups, resistance gene (annotation to ARO unigenes) species attribution analysis, etc. (For AROs with longer partial names, use the first three words and the underline abbreviations)


7.7.2 Survey of resistance gene abundance

  Starting from the relative abundance table of resistance genes, the content and percentage of ARO in each sample were calculated, and the top 20 ARO results of the largest abundance were shown as follows:

  • Figure stat.ARO.ppm represents the relative abundance of all genes in each sample of ARO, in ppm;

  • The graph stat.ARO.RelativePercent represents the relative abundance of top20 ARO in all AROs, and others is the sum of the relative abundances of non-top 20 AROs.


7.7.3 Circos of ARO abundance

  In order to more intuitively observe the abundance ratio of ARO in each sample, it is more intuitive to display the overall distribution of each ARO abundance, and select the ARO of the maximum abundance top10 to draw an overview circle diagram, as shown below:

.

  • The circle diagram is divided into two parts, the sample information on the right and the ARO information on the left. Different colors in the inner circle indicate different samples and ARO, the scale is relative abundance in ppm, the left side is the sum of the relative abundances in each ARO sample, and the right side is the relative abundance of each ARO in a sample. And the left side of the outer ring is the relative percentage of each sample in an ARO, and the right side of the outer ring is the relative percentage of each ARO in a sample.

7.7.4 Distribution and abundance cluster analysis of resistance gene types

  In order to reflect the distribution of each resistance gene type (ARO) in each sample, a black and white heatmap of the ARO distribution was drawn.The dominant 30 genera within each sample are selected based on the results of annotation and abundance information, and then clustered by its ARO information and the inter-sample differences among samples.

  • Figure bw is the heatmap of the ARO distribution, the horizontal axis is the sample name, the right vertical axis is the resistance gene type ARO name, black means the sample contains the ARO, white means the sample does not have the ARO;

  • Figure heat is the ARO abundance cluster heatmap, the right vertical axis is the ARO name, the left vertical axis cluster tree is the ARO cluster tree, and the intermediate heat map corresponds to the value of each row ARO relative abundance after standardization The resulting Z value.


Novogene Co., Ltd


8 Gene Expression Analysis


8.1 Reference Alignment


  Transcriptome reconstructed by Trinity is used as a reference (ref). RSEM (Li et al., 2011) is the software used to do the alignment. The summary of mapping results is shown as follows:

Sample.name Total.reads Total.mapped
ten_ug1 4110932 2760737(67.16%)
ten_ug2 3986768 2404445(60.31%)
ten_ug3 4077538 2413363(59.19%)
control1 4033364 2749575(68.17%)
control2 4033271 1930134(47.86%)
control3 4089642 3208215(78.45%)
one_00ng_1 4053201 2685384(66.25%)
one_00ng_2 4102782 2895393(70.57%)
one_00ng_3 4063720 2636964(64.89%)
  • Sample name: Name of samples
  • Total reads: Quantitative statistics of sequencing sequences after sequencing data filtering (Clean data)
  • Total mapped: Statistics that can be mapped to the number of sequencing sequences on the reference sequence

8.2 Summary of Gene Expression Levels


  To calculate the gene expression level, RSEM analysed the mapping results of Bowtie, and then got the read count for each gene of each sample. Furthermore, the read counts were converted into FPKM value. In RNA-seq, FPKM, short for the expected number of Fragments Per Kilobase of transcript sequence per Millions base pairssequenced, is the commonest method of estimating gene expression levels, which takes into account the effects of both sequencing depth and gene length oncounting of fragments. The results (part of all results) are shown as follows:

table 8-1 read count and fpkm statistical results:
gene_id ten_ug1 ten_ug2 ten_ug3 control1 control2 control3 one_00ng_1 one_00ng_2 one_00ng_3
TRINITY_DN239031_c0_g1 1 2 2 1 12 0 1 0 3
TRINITY_DN153190_c3_g1 0 1 0 0 0 0 1 0 0
TRINITY_DN153190_c3_g2 4 5 0 2 0 2 0 0 1
TRINITY_DN40998_c0_g1 0 4 6 0 1 0 0 0 0

  Gene_id:Gene ID number after CORSET de-redundancy;

  Others: the read count statistics for the gene for each sample.



table 8-2 Fpkm statistics for each sample:
gene_id ten_ug1 ten_ug2 ten_ug3 control1 control2 control3 one_00ng_1 one_00ng_2 one_00ng_3
TRINITY_DN239031_c0_g1 0.56 1.26 1.28 0.55 9.54 0.00 0.57 0 1.73
TRINITY_DN153190_c3_g1 0.00 11.43 0.00 0.00 0.00 0.00 11.62 0 0.00
TRINITY_DN153190_c3_g2 16.49 20.27 0.00 6.79 0.00 6.19 0.00 0 3.80
TRINITY_DN40998_c0_g1 0.00 5.70 8.88 0.00 1.81 0.00 0.00 0 0.00

Gene_id:Gene ID number after CORSET de-redundancy;

Others:Fpkm statistical results of this gene for each sample.


Novogene Co., Ltd


8.3 Gene FPKM density distribution map


  The FPKM density distribution as a whole reflects the gene expression pattern of each sample. The graph is a non-standard normal distribution with a region area of 1, representing a total of 1 probability, and a peak of the density distribution curve representing the highest number of genes at that level. The results are shown in the figure below:

8.4 Comparison of gene expression levels under different experimental conditions


  The density distribution map and box plot of the FPKM of the genes under different experimental conditions were used to examine the distribution of FPKM under different experimental conditions at the overall level. The results of the project are shown in the figure below:

  • FPKM_density_distribution: The abscissa is the log10 (FPKM) value of the gene. The higher the value, the higher the expression level of the gene; the ordinate is the density corresponding to log10 (FPKM), and the different colors represent different samples. To measure the difference between the samples.

  • FPKM_boxplot: FPKM box plot, the abscissa is the sample name, the ordinate is log10 (FPKM+1), and the box plot for each region has five statistics (from top to bottom are the maximum, the upper quartile, median) , lower quartile and minimum), the graph measures total expression the difference between samples in terms of discreteness.

  • FPKM_interval.bar: The abscissa is the sample name, and the ordinate is the fpkm interval and the number of genes. Different colors represent different expression volume intervals, and the graph can display the overall distribution of each sample’s fpkm value.


Novogene Co., Ltd


8.5 Gene Expression Analysis Q&A


Q: What are the differences between FPKM and RPKM?

A: Both FPKM and RPKM are used in gene expression level normalization. FPKM will consider the fragment difference in the two read of paired end read, while RPKM consider reads: rather than using read counts, approximates the relative abundance of transcripts in terms of fragments observed from an RNA-Seq experiment, which may not be represented by a single read, such as in paired-end RNA-Seq experiments.

Q: What is the gene expression threshold? what is the rationale?

A: In bioinformatic analysis without genome reference, we think the gene with FPKM>0.3 is expressed. This threshold is recommend by many prestigious journals and can well reflect the gene expression to large extent. 

Q: In TRINITY reconstruction result, what is the reason the unmapped case appear and the case that readcount and FPKM are 0?

A: The main reason are listed as following: 
1: Trinity won't use all the clean reads. 
2: It is related to the mismatch parameter(TRINITY reconstruction allow more mismatch than RSEM. For this reason, the reconstruction result may not be matched in alignment process) 
3: The mapping rate in table means the percentage of two end mapping. If only one end is mapped, the read is not be counted (For PE sequencing).

Q: Are the transcripts included in gene expression analysis, besides the unigene?

A: We use the software RSEM to do analysis, it is compatible with Trinity and can do the quantification on transcripts as well as unigene. Although Novogene do the quantification on unigene and Transcripts in the same time, the following analysis is based on unigene and the result provides the unigene expression level.

Terminology:

FPKM: expected number of Fragments Per Kilobase of transcript sequence per Millions base pairs sequenced.
RPKM: expected number of reads Per Kilobase of transcript sequence per Millions base pairs sequenced.


Novogene Co., Ltd


9 Gene Expression Difference Analysis


  Read count value obtained from the gene expression analysis is used as the input data to do differential expression analysis.The analysis is mainly divided into three parts:

  (1) First normalize the readcount(normalization);

  (2) Then calculate the hypothesis test probability (pvalue) according to the model;

  (3) Finally, multiple hypothesis test corrections are performed to obtain FDR values (error discovery rate).

  Different software will be used to analyze the differential expression of genes for different situations. The analysis method is as follows:

type software mathod for standardization pvalue FDR parameter for filter
With biological duplicates DESeq2(Anders et al, 2014) DESeq Negative binomial distribution BH |log2(FoldChange)| > 1&qvalue < 0.005
Without biological duplicate edgeR(Robinson MD et al., 2010) TMM Poisson distribution BH |log2(FoldChange)| > 1&qvalue < 0.0005

9.1 Differentially Expressed Genes

gene_id ten_ug_readcount control_readcount log2FoldChange pvalue padj
TRINITY_DN252341_c0_g1 0 1802.78806 -26.444 0 0
TRINITY_DN123362_c0_g1 0 41.88569 -21.451 0 0
TRINITY_DN125455_c0_g1 0 34.06702 -21.163 0 0
TRINITY_DN135982_c0_g1 0 36.85940 -21.259 0 0

(1)gene_id
(2)Sample_readcount:Standardized readcount value for each sample.
(3)log2FoldChange:log2(Sample1/Sample2).
(4)pval:pvalue in statistical tests.
(5)padj:corrected pvalue. The smaller the p-adjusted is, the more significant differentially expressed genes.


Novogene Co., Ltd


9.2 Volcano diagram of differential expression genes


  Volcano plots can be used to infer the overall distribution of differentially expressed genes.

  For experiments with biological replicates, as the DESeq2 already eliminates the biological variation, the default threshold is normally set as: padj < 0.05. Red dots represent up-regulated genes and green dots represent down-regulated genes.


Novogene Co., Ltd


9.3 Venn diagram of differential expression genes


  Venn diagram presents the number of different expressed genes in each group and the overlaps between groups. (Only the diagram of 2,3 and 4 groups are provided). And the coExpression_venn diagram presents the number of genes that are uniquely expressed within each sample, with the overlapping regions showing the number of genes that are expressed in two or more samples.


Novogene Co., Ltd


9.4 Cluster Analysis of differential expression genes


  Cluster Analysis of differential expression genes was used to estimate expression pattern of differential expression genes under different experimental conditions. Through clustering genes in similar expression pattern, unknown functions of transcripts were recognized, with the reason that same kind of transcripts have similar functions or participate in the same metabolic processes or cellular pathways. Hierarchical clustering analysis was carried out with the log10(FPKM+1) of union differential expression genes of all comparison groups under different experimental conditions.

  Three methods: H-cluster, K_means_cluster and SOM_cluster, are applied for the cluster analysis. We firstly calculate log2(FPKM+1) of the expression level of differential genes, and perform clustering after central calibration. The differential expressed genes are devided into several clusters. Genes in one cluser have similar expression level changing trend at different conditions. Expression mode cluster graph of different treatment groups is shown below. The x axis represents sample names, the y axis represents the corrected expression level value.

9.5 Gene Expression Analysis Q&A


Q: How to define differentially expressed genes? How to define the up-regulated gene and down-regulated gene?

A: For biologically replicated sample, DESeq2 is used to do the analysis, the filtering threshold is padj < 0.05. If log2Folderchange of a gene is larger than 0, the gene is considered up-regulated, gene with log2Folderchange < 0 is considered as down-regulated.

Q: Can we use FPKM to do gene expression difference analysis?

A: We use readcount to do gene expression difference analysis, which can be normalized by DESeq2 or TMM. Though FPKM can be applied, it normalization perfomance is worse off than DESeq and TMM. Their performace indices are listed in the table below. Thus we do not recomond FPKM being used in differential anaysis.

Q: What is the largest passible gene filtering threshold? Is it necessary to find related genes by adjusting filtering threshold of differential expressed genes?

A: There are some high-ranking journals that are strict with threshold parameter setting. While some articles have relatively easier policy. For example,qvalue is the only parameter required for differential expressed genes filtering in some non-biological-repication paper, without considering log2foldchange While some other papers require pvalue as a filtering parameter.

Q: How to evaluate the gene expression difference among samples?

A: Significant difference can be detected by corrected pvalue. The Lower the corrected pvalue is, the higher the signicance is. In the meantime, the difference can be evaluated through |log2Foldchange|, higher |log2Foldchange| means larger differences in expression level.

Q: Why is it if one gene has very different expression levels in two samples, wherase it is not in the significantly differential expressed gene list?

A: The selection of differentially expressed gene has statistical difference in absolute significance: 
Firstly: Sequencing depth has some effcts: higher sequencing depth tend to bring more readcount, which should be offseted by normalization. (In project with biological replicates, Using DESeq; Without biological replicates, Using TMM). 
Secondly: In differentially expressed gene analysis, we need to estimate the distribution of readcount, empirically, normally, readcount follows negative binomial distribution. In the project with replicates, the quality of replication will affect the result. If the replication is not some differentially expressed gene will be masked. After estimating the parameter, a statistical examination should be performed. 
Finally: After pvalue calculation, we need to apply a correction on pvalue basing on multiple hypotheses examination to reduce the false positive. Generally, this process will make padj greater than pvalue, making part of gene passed pvalue threshold be filtered by padj.

Q: What is the method for clustering analysis?

A: Clustering uses a build-in R package, pheatmap. We focus on the data (union_for_cluster), which is the overlap of gene set of different samples. And we use relative gene expression level, log2(ratios), to do clustering. The clustering will calculate the distance between each genes, and evaluate the relative gene distance through iteration. Finally, it will divide the genes into several subgroups according to gene distance. H-clusger, K-means and SOM are the main clustering method used by us. Most of them can be implemented in R language and other open source softwares.

Q: Why we use padj, can we use pvalue?

A: Corrected p value (padj/qvalue), is value after applying multiple statistical hypothesis examination on p-value, which is higher than original p-value and more efficient in control false positive.


Novogene Co., Ltd


10 Enrichment Analysis

  The enrichment analysis of differential expressed genes is able to identify their most significantly associated biological function paths under different conditions. Novogene adopt the GOseq and KOBAS softwares to conduct GO enrichment ananlysis and KEGG enrichment analysis respectively. The enrichment analysis is based on the hypergeometric distribution theory. As shown in the picture below, the differential gene set is a list of differential genes from significant analysis, the background gene set is a list of all genes annotated from GO/KEGG enrichment ananlysis. The enrichment ananlysis results are acquired by enrichment on differential gene sets of each differential comparison group. The table of final report displays the enrichment analysis result of the first comparison group filled on the BI form. The picture displayed is the enrichment analysis results of all comparison groups.



Novogene Co., Ltd


10.1 GO Enrichment Analysis of DEGs


10.1.1 GO Enrichment Analysis of DEGs


  Gene Ontology (http://www.geneontology.org/) is a major bioinformatics initiative to unify the presentation of gene and gene product attributes across all species. DEGs refer to differentially expressed genes.

Significantly Enriched GO terms in DEGs:

Category GOID Description GeneRatio BgRatio pvalue padj
MF GO:0003700 DNA-binding transcription factor activity 367|4586 7737|134678 0.0003487 0.042437
CC GO:0005773 vacuole 25|4586 287|134678 0.0006287 0.042437
CC GO:0005622 intracellular 1541|4586 39034|134678 0.0018464 0.081145
CC GO:0005623 cell 1706|4586 43538|134678 0.0024043 0.081145

  1. GO accession:Gene Ontology entry.
  2. Description:Detailed description of Gene Ontology.
  3. Term type:GO types, including cellular_component, biological_process and molecular_function
  4. Over represented p-Value: p-value in hypergenometric test
  5. Corrected P-Value:Corrected P-value; GO with corrected p-value < 0.05 are significantly enriched in DEGs.
  6. DEG item:Number of DEGs with GO annotation.
  7. DEG list:Number of all reference genes with GO annotation.


Novogene Co., Ltd


10.1.2 GO enrichment Bar Chart of DEGs

  Top 20 significantly enriched terms in GO enrichment analysis results is displayed as the bar graph below. All terms are shown when the total number is less than 20.

  • The x-axis shows GO term in the sub-level of the GO three main domains. The y-axis shows the number of the differential expressed genes annotated in this term. From left to right are the three main GO domains: biological_process, cell_composition and molecular_function.

10.1.3 GO Enrichment DAG Figure


  TopGO DAG (Directed Acyclic Graph, DAG) can visually display the enriched GO term of differential expression genes and its hierarchical relation. The branches represent hierarchical relationship and the function ranges become more and more specific from top to bottom. DAG of biological process, molecular function and cellular component are shown respectively in the report. Top 10 significantly enriched terms of the three classes are taken as the main nodes of DAG, which are shown as boxes along with the associated GO terms. The darkness represents degree of enrichment, darker color means higher enrichment degree.

  • Each node represents a GO term, and Top 10 GO terms are boxed. The darker the color is, the higher is the enrichment level of the term. The name and p-value of each term are present on the node.


Novogene Co., Ltd


10.2 KEGG Pathway Enrichment Analysis


10.2.1 KEGG Pathway Enrichment Analysis


  The interactions of multiple genes may be involved in certain biological functions. KEGG (Kyoto Encyclopedia of Genes and Genomes) is a collection of manually curated databases dealing with genomes, biological pathways, diseases, drugs, and chemical substances. KEGG is utilized for bioinformatics research and education, including data analysis in genomics, metagenomics, metabolomics and other omics studies. Pathway enrichment analysis identifies significantly enriched metabolic pathways or signal transduction pathways associated with differentially expressed genes compared with the whole genome background.

KEGG Enrichment List:

Term Database ID Sample number Background number P-Value Corrected P-Value
ko02060 Phosphotransferase system (PTS) 84|5134 734|79075 2.80e-06 0.0001163 84
ko02020 Two-component system 550|5134 7059|79075 1.93e-05 0.0006626 550
ko04141 Protein processing in endoplasmic reticulum 114|5134 1160|79075 3.53e-05 0.0010379 114
ko01100 Metabolic pathways 2755|5134 39785|79075 4.52e-05 0.0011645 2755

  1. Term:Description of KEGG pathways
  2. ID:KEGG ID.
  3. Sample Number:Number of DGEs with pathway anntation.
  4. Background Number:Number of all reference genes with pathway annotation.
  5. P-value:P-value in hypergeometric test.
  6. Corrected P-value:Pathways with corrected p-value < 0.05 are significantly enriched in DEGs.


Novogene Co., Ltd


10.2.2 KEGG Enrichment Scattered Plot

  KEGG enrichment scatter plot shows the DEGs enrichment analysis results in KEGG pathway. The degree of KEGG enrichment is measured by Rich factor, q-value and the number of genes enriched in this pathway. Rich factor refers to the ratio of the DEGs number in the pathway and the number of all genes annotated in the pathway. Q-value is the normalized p-value in range [0,1]. The smaller q-value is, the more significant the enrichment is. The top20 significantly DEGs enriched pathways are displayed in the report. If the enriched pathways are less than 20, all enriched pathways are displayed:

10.2.3 KEGG Enrichment Pathway


  KEGG enrichment pathway shows the DEGs significantly enriched pathways. In the diagram, if this node contains up-regulated genes, the KO node is labeled in red. If the node contains down-regulated genes, the KO node is labeled in green. If the node contains both up and down-regulated genes, the labeled color is yellow.

10.3 Enrichment Analysis Q&A


Q:What are the meaning and advantages of differentially expressed gene enrichment analysis?

A: Gene enrichment analysis is a method gathering genes into known biological function or pathway through bioinformatic database and statistical tools, which helps us deeply understand gene functions in in biological terms. The purpose is to screen out gene clusters of two or more comparison groups with different expresseion levels, this is the extension of single-gene study. This method includes GO enrichment analysis and KEGG enrichment analysis. The GO enrichment analysis help us understand gene function of differential expressed gene, while the KEGG enrichment analysis help us unde rstand the metabolic pathway that the genes participate in. The gene pathway can quickly set up the correlation network between a specific gene and its phenotype.

Gene enrichment analysis has following advantages compared to analysis based on single gene:
1 Higher sensitivity and lower rate of false postive.
2 Integrating the information of gene interaction, the results are more reliable 
3 Simutaneously dealing with large gene expression data with uniform classfication information.
4 Being reproductive. 

Q: What are the meaning of color and node in KEGG pathway figure

A: KEGG Pathway have five types :
map - Reference pathway
ko - Reference pathway (KO)
ec - Reference pathway (EC)
rn - Reference pathway (Reaction)
org - Organism-specific pathway map

Nodes and color: 

1 "map" pathway
Node represent a specific gene, enzyme coded by the gene and related reaction, click the node to get these informations
2 "ko/ec/rn" pathway
In ko pathway, the node only represent a specific gene; In ec pathway, the node represent the related enzyme;In rn pathway, node represent a particular reaction including the substrates and reaction type. (Background is blue).
3 "org" pathway:
Species specific pathway figure, same as map pathway. (The background of species related node is green)

Each pathway has a unique ID number, such as map00010, which is the access number in KEGG database. In project without genome reference, it can only be ko -reference pathway.


Novogene Co., Ltd


11 Appendix


11.1 Explanation on result file

the.format.of.compressed.file user.type method
compressed files in the fomat of *.tar.gz Unix/Linux/Mac user use tar -zxvf *.tar.gz command
compressed files in the fomat of *.tar.gz Windows user use uncompressed software such as WinRAR, 7-Zip et al
compressed files in the fomat of *.gz Unix/Linux/Mac user use gzip –d *.gz command
compressed files in the fomat of *.gz Windows user use uncompressed software such as WinRAR, 7-Zip et al
compressed files in the fomat of *.zip Unix/Linux/Mac user use unzip *.zip command
compressed files in the fomat of *.zip Windows user use uncompressed software such as WinRAR, 7-Zip et al

11.2 Explanation on how to operating file

file.type file.description Open.mode
*.fasta sequence file, in the fomat of fasta. in genaral, because sequence of gene or genome is large, we provide customer with sampled.fasta file(partial sequences of .fasta). It will be more convenient to check file fomat. unix/Linux/Mac user use less or more command can view sequences in the fomat of *.fasta.
*.fasta sequence file, in the fomat of fasta. in genaral, because sequence of gene or genome is large, we provide customer with sampled.fasta file(partial sequences of .fasta). It will be more convenient to check file fomat. windows user use editor Editplus/Notepad++ et al
*.fq/fastq reads sequence file, in the fomat of fasta. it is no easy to open because of big file. unix/Linux/Mac user use less or more command;
*.fq/fastq reads sequence file, in the fomat of fasta. it is no easy to open because of big file. windows user use editor Editplus/Notepad++ et al
.xls,.txt table result file; files are separated by(Tab) unix/Linux/Mac user use less or more command
.xls,.txt table result file; files are separated by(Tab) windows user use editor Editplus/Notepad++ et al, also can use Microsoft Excel to open.
*.png figure result files; bitmap unix/Linux/Mac user use display command to open.
*.png figure result files; bitmap unix/Linux/Mac user use display command to open. windows user use picture viewer to view, for example photoshop et al
*.pdf figure result files; vectorgraph, can be zoom in or out. it is very convenient for customer to view or edit. using Adobe Illustrator to edit,the picture can be used in paper for publishing. windows/Mac user use Adobe Reader/Foxit reader/web browser to open
*.pdf figure result files; vectorgraph, can be zoom in or out. it is very convenient for customer to view or edit. using Adobe Illustrator to edit,the picture can be used in paper for publishing. unix/Linux user use evince command to open.

11.3 Software List

analysis software version parameter remarks
QC fastp 0.23.1 qualified_quality_phred 5, unqualified_percent_limit 50, n_base_limit 15, min_trim_length 10, overlap_diff_limit 1, overlap_diff_percent_limit 10, Other parameters are default parameters Sample Quality Control
Assembly Trinity v2.6.6 min_kmer_cov:2, Other parameters are default parameters
Cluster CORSET version v1.09 default parameter hierarchical cluster based on the transcript’s reads and expression patterns
gene function annotation diamond v2.1.6 NR, CAZy, KEGG, eggNOG:e-value = 1e-5,–more-sensitive: NR, CAZy, KEGG, eggNOG database annotation
GO annotation blast2go b2g4pipe_v2.5 e-value = 1.0E-6 GO annotation
mapping RSEM v1.2.28 bowtie2 parameter mismatch 0 transcripts Assemblying by trinity align and Quantitative
differential analysis edgeR 3.28.0 padj<0.005 samples without bio-replicate using edgeR
differential analysis DESeq2 1.26.0 padj<0.05 samples with bio-relicate using DESeq2
GO enrich GOSeq, topGO 1.32.0, 2.32.0 Corrected P-Value<0.05
KEGG enrich KOBAS v3.0 Corrected P-Value<0.05

11.4 Methods

Methods in English: PDF


Novogene Co., Ltd


12 References


*1. Leimena M M, Ramiro-Garcia J, Davids M, et al. A comprehensive metatranscriptome analysis pipeline and its validation using human small intestine microbiota datasets[J]. BMC genomics, 2013, 14(1): 530.

*2. Li B, Dewey C. (2011). RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics, doi:10.1186/1471-2105-12-323.

*3. Trapnell C, Williams B A, Pertea G, et al. (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotech 28, 511–515. (FPKM)

*4. Wang L, Feng Z, Wang X, et al. DEGseq: an R package for identifying differentially expressed genes from RNA-seq data. Bioinformatics 26, 136-138.

*5. Young M D, Wakefield M J, Smyth G K, et al. (2010).Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biology, doi:10.1186/gb-2010-11-2-r14. (GOseq)

*6. Kanehisa M, Araki M, Goto S, et al. (2008). KEGG for linking genomes to life and the environment. Nucleic Acids research 36:D480-D484.

*7. Stewart F J, Ottesen E A, DeLong E F. Development and quantitative analyses of a universal rRNA-subtraction protocol for microbial metatranscriptomics[J]. The ISME journal, 2010, 4(7): 896-907.

*8. Tveit A, Urich T, Svenning M M. Metatranscriptomic analysis of Arctic peat soil microbiota[J]. Applied and environmental microbiology, 2014: AEM. 01030-14.

*9. Gosalbes M J, Durbán A, Pignatelli M, et al. Metatranscriptomic approach to analyze the functional human gut microbiota[J]. PloS one, 2011, 6(3): e17447.

*10. Mason O U, Hazen T C, Borglin S, et al. Metagenome, metatranscriptome and single-cell sequencing reveal microbial response to Deepwater Horizon oil spill[J]. The ISME journal, 2012, 6(9): 1715-1727.

*11. Booijink C C G M, Boekhorst J, Zoetendal E G, et al. Metatranscriptome analysis of the human fecal microbiota reveals subject-specific expression profiles, with genes encoding proteins involved in carbohydrate metabolism being dominantly expressed[J]. Applied and environmental microbiology, 2010, 76(16): 5533-5540.

*12. Huson, Daniel H., et al. “Integrative analysis of environmental sequences using MEGAN4.” Genome research21.9 (2011): 1552-1560.